We seek to answer four outstanding questions in this writeup:

  1. Is the Fisher exact test calibrated on the negative control data? If so, can we draw conclusions about why the other methods might be miscalibrated?
  2. Why does the Fisher exact test yield a large fraction of p-values equal to exactly 1?
  3. Why are the methods poorly calibrated on the Schraivogel perturb-seq data?
  4. What QC threshold should we be using? If we adopt a more stringent QC threshold, do the methods remain miscalibrated?

Question 1: Is the Fisher exact test calibrated on the negative control data?

I applied Fisher’s exact test to the negative control data within the framework of the undercover pipeline (using group size = 1). Below, I plot the resulting p-values on a negative log 10 transformed scale.

Amazingly, the p-values are calibrated far into the tail of the distribution on all of the datasets (with the exception of the Papalexi protein data and the Schraivogel perturb-seq data; more on those below). Hence, confounding likely is negligible (because if confounding were non-negligible, then the Fisher exact test, which is a test of marginal rather than conditional independence, would be miscalibrated). Seurat DE and SCEPTRE (not depicted) also are nonparametric tests of marginal independence. Both of these methods are miscalibrated on the negative control data. The miscalibration of Seurat DE and SCEPTRE likely is not the result of confounding; rather, the assumptions that Seurat DE and SCEPTRE make about the null distribution of the test statistic (i.e., that the test statistic is Gaussian distributed or skew-normal distributed) likely are wrong for some fraction of the pairs.

Next, I plot the negative control p-values on an untransformed scale.

A nontrivial fraction of the p-values is exactly equal to one, causing the bulk of the distribution to be conservative. I investigate reasons for this in the next section.

Finally, I applied Fisher’s exact test to the (grouped) positive control data. The number of discoveries that the Fisher’s exact test made after a Bonferoni correction is plotted below (the results of the other methods are plotted as well for reference). Clearly, Fisher’s exact test quite capable of making discoveries. However, Fisher’s exact test does appear to be slightly less powerful than the other methods (aside from MIMOSCA); this might be a result of the fact that Fisher’s exact test is the only calibrated method among those plotted.

Some questions remain regarding the application of Fisher’s exact test to single-cell CRISPR screen data. Nonetheless, if I had to advise a biologist tomorrow about what method she should use to analyze her single-cell CRISPR screen data, I would recommend Fisher’s exact test.

Question 2: Why does the Fisher exact test yield a large fraction of p-values exactly equal to one?

The Fisher exact test yields a p-value exactly equal to one for a large fraction of the negative control pairs. This likely is due to the fact that, for highly expressed features (e.g., protein data), the contingency table is extremely unbalanced. We might ameliorate this problem by dichotomizing the gene expression counts at their median rather than at zero.

Question 3: Why are the methods miscalibrated on the Schraivogel perturb-seq data?

All methods (with the exception of MIMOSCA) are miscalibrated on the Schraivogel perturb-seq data. Why is this the case?

One possible explanation for the miscalibration that we observe is confounding. Gene carried out an analysis to assess the extent to which the perturbation indicators are confounded by the technical factors across datasets. Gene did not find any evidence of confounding on the Schraivogel perturb-seq data. Thus, confounding likely is not the culprit for the problems that we see here (unless, of course, there is unmeasured confounding).

I hypothesized that variability among the NTCs might be responsible for the miscalibration that we observe. If some NTCs actually have an effect (i.e., if some NTCs are not true NTCs), and if others have no effect, then the undercover p-values contain some amount of signal. To investigate this hypothesis, I carried out an undercover analysis in which I randomly partitioned the NTCs into two equally sized groups of “control” and “undercover” gRNAs, “averaging out” and thus mitigating the effect of any non-true NTC(s). (In other words, I ran the undercover gRNA pipeline, setting the group size equal to half the total number of NTCs.)

The calibration of the p-values improves considerably. However, the methods still are miscalibrated. I suspect that variability among the NTCs is at least partially responsible, but the full story still is not clear.

Question 4: what is a meaningful QC threshold to use?

We must determine a reasonable QC threshold to use in our analysis. If we choose the QC threshold too liberally (i.e., if we allow too many pairs to pass through the QC filter), then we risk putting ourselves in an “uninteresting” region of the problem space in which we are under-powered to make discoveries for pairs with low effective sample sizes. This not only wastes compute but also causes a reduction in power at the multiple testing correction step. By contrast, if we choose our QC threshold too stringently (i.e., if we allow too few pairs to pass through the QC filter), then we risk throwing out biologically interesting pairs that we might have otherwise rejected. An additional complication is that the “effective sample size” of a given pair (i.e., the number of cells with nonzero gene expression) is a function of both the gene and the gRNA. Thus, QC in single-cell CRISPR screen analysis must operate at the level of the gene-gRNA pair (instead of at the level of the gene and the gRNA separately).

A reasonable strategy for selecting the QC threshold is as follows: locate the minimum effective sample size such that a well-calibrated method (e.g., Fisher’s exact test) consistently rejects positive control pairs whose effective sample size is (approximately) equal to that minimizer. To determine this threshold empirically, I applied Fisher’s exact test, SCEPTRE, the Weissman Method, and Seurat DE to the singleton (i.e., ungrouped) positive control data. (I did not apply MIMOSCA and Schraivogel method, as these methods are computationally expensive. Furthermore, I did not apply Liscovitch method, as Liscovitch method is badly miscalibrated and thus uninformative.) First, I plot the number of rejections that each method makes on each dataset.

Seurat DE and SCEPTRE appear to be the most powerful methods in general, with Seurat DE having a slight advantage over SCEPTRE. Next, I plot a histogram of the “effective sample size” of each pair, faceted by dataset.

Effective sample sizes for the (singleton) positive control pairs range from zero up to several thousand in most cases (and up to 10,000+ on the Schraivogel ground truth perturb-seq data).

Next, I plot the Fisher exact p-value vs. the effective sample size of each pair on each of the positive control datasets. Pairs with a p-value of above 0.001 are colored in red and those below 0.001 are colored in blue. (The number 0.001 was chosen somewhat arbitrarily; basically, 0.001 is a proxy for the threshold that a p-value might have to surpass for the pair to be rejected.) The vertical black line indicates the pair with the smallest effective sample size such that its p-value is less than 0.001. Across datasets, we see that the minimum effective sample size required to reject a pair at the 0.001 level is about 500.

We the same plot for Seurat DE. The results are similar to those of the Fisher exact test.

The apparent minimum effective sample size required to reject at a 0.001 level (i.e., $\approx$ 250 - 750) is surprisingly large.

The impact stringent QC on calibration

As we apply increasingly stringent QC, the problem becomes easier, and thus calibration should improve. We test that hypothesis here. First, we plot the results that we obtain using the default QC (i.e., filtering for genes that are expressed in > 0.005 of cells).

Next, we remove all pairs with an effective sample size of less than 250. This amounts to removing 30% of pairs. The calibration improves, especially on the fairly sparse Frangieh data.

Next, we restrict our attention to pairs that have an effective sample size exceeding 750. This amounts to removing 50% of (the original) pairs. Calibration improves further.

For better or for worse, when we restrict our attention to more “meaningful” regions of the parameter space, Seurat DE becomes a formidable competitor!

Next steps and open questions